Grounding (i.e. localizing) arbitrary, free-form textual phrases in visualcontent is a challenging problem with many applications for human-computerinteraction and image-text reference resolution. Few datasets provide theground truth spatial localization of phrases, thus it is desirable to learnfrom data with no or little grounding supervision. We propose a novel approachwhich learns grounding by reconstructing a given phrase using an attentionmechanism, which can be either latent or optimized directly. During trainingour approach encodes the phrase using a recurrent network language model andthen learns to attend to the relevant image region in order to reconstruct theinput phrase. At test time, the correct attention, i.e., the grounding, isevaluated. If grounding supervision is available it can be directly applied viaa loss over the attention mechanism. We demonstrate the effectiveness of ourapproach on the Flickr 30k Entities and ReferItGame datasets with differentlevels of supervision, ranging from no supervision over partial supervision tofull supervision. Our supervised variant improves by a large margin over thestate-of-the-art on both datasets.
展开▼